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1  Summary 


According  to  a  2013  Pew  Internet  Project  study  of  2076  people  [1],  91%  of  American  adults  own 
a  cellphone.  Increasingly,  people  are  using  their  phones  to  access  and  store  sensitive  data.  The 
same  study  found  that  81%  of  cellphone  owners  use  their  mobile  device  for  texting,  52%  use  it 
for  email,  49%  use  it  for  maps  (enabling  location  services),  and  29%  use  it  for  online  banking. 
And  yet,  securing  the  data  is  often  not  taken  seriously  because  of  an  inaccurate  estimation  of  risk 
as  discussed  in  [2].  In  particular,  several  studies  have  shown  that  a  large  percentage  of 
smartphone  owners  do  not  lock  their  phone:  57%  in  [3],  33%  in  [4],  39%  in  [2],  and  48%  in  this 
study. 

Active  authentication  is  an  approach  of  monitoring  the  behavioral  biometric  characteristics  of  a 
user’s  interaction  with  the  device  for  the  purpose  of  securing  the  phone  when  the  point-of-entry 
locking  mechanism  fails  or  is  absent.  In  recent  years,  continuous  authentication  has  been 
explored  extensively  on  desktop  computers,  based  either  on  a  single  biometric  modality  like 
mouse  movement  [5]  or  a  fusion  of  multiple  modalities  like  keyboard  dynamics,  mouse 
movement,  web  browsing,  and  stylometry  [6].  Unlike  physical  biometric  devices  like  fingerprint 
scanners  or  iris  scanners,  these  systems  rely  on  computer  interface  hardware  like  the  keyboard 
and  mouse  that  are  already  commonly  available  with  most  computers. 

In  this  report,  we  consider  the  problem  of  active  authentication  on  mobile  devices,  where  the 
variety  of  available  sensor  data  is  much  greater  than  on  the  desktop,  but  so  is  the  variety  of 
behavioral  profiles,  device  form  factors,  and  environments  in  which  the  device  is  used.  We  study 
four  representative  modalities  of  stylometry  (text  analysis),  application  usage  patterns,  web 
browsing  behavior,  and  physical  location  of  the  device.  In  the  remainder  of  the  report  these  four 
modalities  will  be  referred  to  as  text,  app,  web,  and  location,  respectively.  We  consider  the 
trade-off  between  intruder  detection  time  and  detection  error  as  measured  by  false  accept  rate 
(FAR)  and  false  reject  rate  (FRR).  The  analysis  is  performed  on  a  dataset  collected  by  the 
authors  of  200  subjects  using  their  personal  Android  mobile  device  for  a  period  of  at  least  30 
days.  To  the  best  of  our  knowledge,  this  dataset  is  the  first  of  its  kind  studied  in  active 
authentication  literature,  due  to  its  large  size  [7],  the  duration  of  tracked  activity  [8],  and  the 
absence  of  restrictions  on  usage  patterns  and  on  the  form  factor  of  the  mobile  device.  The 
geographical  colocation  of  the  participants,  in  particular,  makes  the  dataset  a  good  representation 
of  an  environment  such  as  a  closed-world  organization  where  the  unauthorized  user  of  a 
particular  device  will  most  likely  come  from  inside  the  organization. 

We  propose  to  use  decision  fusion  in  order  to  asynchronously  integrate  the  four  modalities  and 
make  serial  authentication  decisions.  While  we  consider  here  a  specific  set  of  binary  classifiers, 
the  strength  of  our  decision-level  approach  is  that  additional  classifiers  can  be  added  without 
having  to  change  the  basic  fusion  rule.  Moreover,  it  is  easy  to  evaluate  the  marginal 
improvement  of  any  added  classifier  to  the  overall  performance  of  the  system.  We  evaluate  the 
multimodal  continuous  authentication  system  by  characterizing  the  error  rates  of  local  classifier 
decisions,  fused  global  decisions,  and  the  contribution  of  each  local  classifier  to  the  fused 
decision.  The  novel  aspects  of  our  work  include  the  scope  of  the  dataset,  the  particular  portfolio 
of  behavioral  biometrics  in  the  context  of  mobile  devices,  and  the  extent  of  temporal 
performance  analysis. 
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The  remainder  of  the  report  is  structured  as  follows.  In  §2,  we  discuss  the  related  work  on 
multimodal  biometric  systems,  active  authentication  on  mobile  devices,  and  each  of  the  four 
behavioral  biometrics  considered  in  this  report.  In  §3,  we  discuss  the  200  subject  dataset  that  we 
collected  and  analyzed.  In  §4,  we  discuss  four  biometric  modalities,  their  associated  classifiers, 
and  the  decision  fusion  architecture.  In  §5,  we  present  the  performance  of  each  individual 
classifier,  the  performance  of  the  fusion  system,  and  the  contribution  of  each  individual  classifier 
to  the  fused  decisions. 

2  Introduction 

2.1  Multimodal  Biometric  Systems 

The  window  of  time  based  on  which  an  active  authentication  system  is  tasked  with  making  a 
binary  decision  is  relatively  short  and  thus  contains  a  highly  variable  set  of  biometric 
information.  Depending  on  the  task  the  user  is  engaged  in,  some  of  the  biometric  classifiers  may 
provide  more  data  than  others.  For  example,  as  the  user  chats  with  a  friend  via  SMS,  the  text- 
based  classifiers  will  be  actively  flooded  with  data,  while  the  web  browsing  based  classifiers  may 
only  get  a  few  infrequent  events.  This  motivates  the  recent  work  on  multimodal  authentication 
systems  where  the  decisions  of  multiple  classifiers  are  fused  together  [9].  In  this  way,  the 
verification  process  is  more  robust  to  the  dynamic  nature  of  human-computer  interaction.  The 
current  approaches  to  the  fusion  of  classifiers  center  around  max,  min,  median,  or  majority  vote 
combinations  [10].  When  neural  networks  are  used  as  classifiers,  an  ensemble  of  classifiers  is 
constructed  and  fused  based  on  different  initialization  of  the  neural  network  [11]. 

Several  active  authentication  studies  have  utilized  multimodal  biometric  systems  but  have  all,  to 
the  best  of  our  knowledge:  (1)  considered  a  smaller  pool  of  subjects,  (2)  have  not  characterized 
the  temporal  performance  of  intruder  detection,  and  (3)  have  shown  overall  significantly  worse 
performance  than  that  achieved  in  our  study. 

Our  approach  in  this  report  is  to  apply  the  Chair- Varshney  optimal  fusion  rule  [12]  for  the 
combination  of  available  multimodal  decisions.  The  strength  of  the  decision-level  fusion 
approach  is  that  an  arbitrary  number  of  classifiers  can  be  added  without  re-training  the  classifiers 
already  in  the  system.  This  modular  design  allows  for  multiple  groups  to  contribute  drastically 
different  classification  schemes,  each  lowering  the  error  rate  of  the  global  decision. 


2.2  Mobile  Active  Authentication 

With  the  rise  of  smartphone  usage,  active  authentication  on  mobile  devices  has  begun  to  be 
studied  in  the  last  few  years.  The  large  number  of  available  sensors  makes  for  a  rich  feature 
space  to  explore.  Ultimately,  the  question  is  the  one  that  we  ask  in  this  report:  what  modality 
contributes  the  most  to  a  decision  fusion  system  toward  the  goal  of  fast,  accurate  verification  of 
identity?  Most  of  the  studies  focus  on  a  single  modality.  For  example,  gait  pattern  was 
considered  in  [7]  achieving  an  EER  of  0.201  (20.1%)  for  51  subjects  during  two  short  sessions, 
where  each  subject  was  tasked  with  walking  down  a  hallway.  Some  studies  have  incorporated 
multiple  modalities.  For  example,  keystroke  dynamics,  stylometry,  and  behavioral  profiling  were 
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considered  in  [13]  achieving  an  EER  of  0.033  (3.3%)  from  30  simulated  users.  The  data  for  these 
users  was  pieced  together  from  different  datasets.  To  the  best  of  our  knowledge,  the  dataset  that 
we  colleeted  and  analyzed  is  unique  in  all  its  key  aspects:  its  size  (200  subjeets),  its  duration 
(30+  days),  and  the  size  of  the  portfolio  of  modalities  that  were  all  tracked  concurrently  with  a 
synchronized  timestamp. 

2.3  Stylometry,  Web  Browsing,  Application  Usage,  Location 

Stylometry  is  the  study  of  linguistie  style.  It  has  been  extensively  applied  to  the  problems  of 
authorship  attribution,  identifieation,  and  verifieation.  See  [14]  for  a  thorough  summary  of 
stylometrie  studies  in  eaeh  of  these  three  problem  domains  along  with  their  study  parameters  and 
the  resulting  aeeuraey.  These  studies  traditionally  use  large  sets  of  features  (see  Table  11  in  [15]) 
in  combination  with  support  vector  machines  (SVMs)  that  have  proven  to  be  effective  in  high 
dimensional  feature  space  [16],  even  in  eases  when  the  number  of  features  exeeeds  the  number 
of  samples.  Nevertheless,  with  these  approaches,  often  more  than  500  words  are  required  in  order 
to  aehieve  adequately  low  error  rates  [17].  This  makes  them  impractical  for  the  application  of 
real-time  aetive  authentieation  on  mobile  deviees  where  text  data  eomes  in  short  bursts. 

While  the  other  three  modalities  are  not  well  investigated  in  the  eontext  of  aetive  authentication, 
this  is  not  true  for  stylometry.  Therefore,  for  this  modality,  we  don’t  reinvent  the  wheel,  and 
implement  the  n-gram  analysis  approaeh  presented  in  [14]  that  has  been  shown  to  work 
sufficiently  well  on  short  blocks  of  texts. 

Web  browsing,  applieation  usage,  and  location  have  not  been  studied  extensively  in  the  eontext 
of  active  authentication.  The  following  is  a  diseussion  of  the  few  studies  that  we  are  aware  of 
Web  browsing  behavior  has  been  studied  for  the  purpose  of  understanding  user  behavior,  habits, 
and  interests  [18].  Web  browsing  as  a  souree  for  behavioral  biometric  data  was  considered  in 
[19]  to  achieve  average  identification  FAR/FRR  of  0.24  (24%)  on  a  dataset  of  14  desktop 
eomputer  users.  Application  usage  was  eonsidered  in  [8],  where  eellphone  data  (from  2004)  from 
the  MIT  Reality  Mining  project  [20]  was  used  to  aehieve  0.1  (10%)  EER  based  on  a  portfolio  of 
metrics  including  application  usage,  eall  patterns,  and  loeation.  Application  usage  and 
movements  patterns  have  been  studied  as  part  of  behavioral  profiling  in  cellular  networks 
[8,21,22].  However,  these  approaches  use  position  data  of  lower  resolution  in  time  and  spaee 
than  that  provided  by  GPS  on  smartphones.  To  the  best  of  our  knowledge,  GPS  traees  have  not 
been  utilized  in  literature  for  continuous  authentication. 


3  Methods 

The  dataset  used  in  this  work  eontains  behavioral  biometries  data  for  200  subjects.  The  colleetion 
of  the  data  was  carried  out  by  the  authors  over  a  period  of  5  months.  The  requirements  of  the 
study  were  that  eaeh  subject  was  a  student  or  employee  of  Drexel  University  and  was  an  owner 
and  an  active  user  of  an  Android  smartphone  or  tablet.  The  number  of  subjeets  with  each  major 
Android  version  and  associated  API  level  are  listed  in  Table  1.  Nexus  5  was  the  most  popular 
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device  with  10  subjects  using  it.  Samsung  Galaxy  S5  was  the  second  most  popular  device  with  6 
subjects  using  it. 


Table  1 :  The  Android  version  and  API  level  of  the  200  devices  that  were  part  of  the  study. 


Android 

Version 

API 

Level 

Subjects 

4.4 

19 

143 

4.1 

16 

16 

4.3 

18 

15 

4.2 

17 

9 

4.0.4 

15 

5 

2.3.6 

10 

4 

4.0.3 

15 

3 

2.3.5 

10 

3 

2.2 

8 

2 

A  tracking  application  was  installed  on  each  subject’s  device  and  operated  for  a  period  of  at  least 
30  days  until  the  subject  came  in  to  approve  the  collected  data  and  get  the  tracking  application 
uninstalled  from  their  device.  The  following  data  modalities  were  tracked  with  1 -second 
resolution: 

•  Text  typed  via  soft  keyboard. 

•  Apps  visited. 

•  Websites  visited. 

•  Location  (based  on  GPS  or  WiFi). 

The  key  characteristics  of  this  dataset  are  its  large  size  (200  users),  the  duration  of  tracked 
activity  (30+  days),  and  the  geographical  colocation  of  its  participants  in  the  Philadelphia  area. 
Moreover,  we  did  not  place  any  restrictions  on  usage  patterns,  on  the  type  of  Android  device,  and 
on  the  Android  OS  version  (see  Table  1). 

There  were  several  challenges  encountered  in  the  collection  of  the  data.  The  biggest  problem  was 
battery  drain.  Due  to  the  long  duration  of  the  study,  we  could  not  enable  modalities  whose 
tracking  proved  to  be  significantly  draining  of  battery  power.  These  modalities  include  front¬ 
facing  video  for  eye  tracking  and  face  recognition,  gyroscope,  accelerometer,  and  touch  gestures. 
Moreover,  we  had  to  reduce  GPS  sampling  frequency  to  once  per  minute  on  most  of  the  devices. 
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Table  2:  The  number  of  events  in  the  dataset  associated  with  each  of  the  four  modalities 

considered  in  this  report. 


Event 

Frequency 

Text 

23,254,478 

App 

927,433 

Web 

210,322 

Location 

143,875 

A  text  event  refers  to  a  single  character  entered  on  the  soft  keyboard.  An  app  events  refers  to  a 
new  app  receiving  focus.  A  web  event  refers  to  a  new  url  entered  in  the  url  box.  A  location  event 
refers  to  a  new  sample  of  the  device  location  either  from  GPS  or  WiFi. 

Table  2  shows  statistics  on  each  of  the  four  investigated  modalities  in  the  corpus.  The  table 
contains  data  aggregated  over  all  200  users.  The  “frequency”  here  is  a  count  of  the  number  of 
instances  of  an  action  associated  with  that  modality.  As  stated  previously,  the  four  modalities 
will  be  referred  to  as  text,  app,  web,  and  “location.”  For  text,  the  action  is  a  single  keystroke  on 
the  soft  keyboard.  For  app,  the  action  is  opening  or  bringing  focus  to  a  new  app.  For  web,  the 
action  is  visiting  a  new  website.  For  location,  no  explicitly  action  is  taken  by  the  user.  Rather, 
location  is  sampled  regularly  at  intervals  of  1  minute  when  GPS  is  enabled.  As  Table  2  suggests, 
text  events  fire  1-2  orders  of  magnitude  more  frequently  than  the  other  three. 

The  data  for  each  user  is  processed  to  remove  idle  periods  when  the  device  is  not  active.  The 
threshold  for  what  is  considered  an  idle  period  is  5  minutes.  For  example,  if  the  time  between 
event  A  and  event  B  is  20  minutes,  with  no  other  events  in  between,  this  20  minutes  is 
compressed  down  to  5  minutes.  The  date  and  time  of  the  event  are  not  changed  but  the 
timestamp  used  in  dividing  the  dataset  for  training  and  testing  (see  §5.1)  is  updated  to  reflect  the 
new  time  between  event  A  and  event  B.  This  compression  of  idle  times  is  performed  in  order  to 
regularize  periods  of  activity  for  cross  validation  that  utilizes  time-based  windows  as  described 
in  §5.1.  The  resulting  compressed  timestamps  are  referred  to  as  “active  interaction”.  Fig.  1  shows 
the  duration  (in  hours)  of  active  interaction  for  each  of  the  200  users  ordered  from  least  to  most 
active. 

Table  3  shows  three  top-20  lists:  (1)  the  top-20  apps  based  on  the  amount  of  text  that  was  typed 
inside  each  app,  (2)  the  top-20  apps  based  on  the  number  of  times  they  received  focused,  and  (3) 
the  top-20  website  domains  based  on  the  number  of  times  a  website  associated  with  that  domain 
was  visited.  These  are  aggregate  measures  across  the  dataset  intended  to  provide  an  intuition 
about  its  structure  and  content,  but  the  top-20  list  is  the  same  as  that  used  for  the  the  classifier 
model  based  on  the  web  and  app  features  in  §4. 

Fig.  2  shows  a  heat  map  visualization  of  a  selection  from  the  dataset  of  GPS  locations  in  the 
Philadelphia  area.  The  subjects  in  the  study  resided  in  Philadelphia  but  traveled  all  over  United 
States  and  the  world.  There  are  two  key  characteristics  of  the  GPS  location  data.  First,  it  is 
relatively  unique  to  each  individual  even  for  people  living  in  the  same  area  of  a  city.  Second, 
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outside  of  occasional  travel,  it  does  not  vary  significantly  from  day  to  day.  Human  beings  are 
creatures  of  habit,  and  in  as  much  as  location  is  a  measure  of  habit,  this  idea  is  confirmed  by  the 
location  data  of  the  majority  of  the  subjects  in  the  study. 

Table  3:  Top  20  apps  ordered  by  text  entry  and  visit  frequency  and  top  20  websites  ordered  by 
visit  frequency.  These  tables  are  provided  to  give  insight  into  the  structure  and  content  of  the 

dataset. 


App  Name 

Keys  Per 
App 

com.android.sms 

5,617,297 

com.android.mms 

5,552,079 

com.whatsapp 

4,055,622 

com.facebook.orca 

1,252,456 

com.google.android.talk 

1,147,295 

com.infraware.polarisviewerd 

990,319 

com.android.chrome 

417,165 

com.facebook.katana 

405,267 

com.snapchat.android 

377,840 

com.google.android.gm 

271,570 

com.htc.sense.mms 

238,300 

com.tencent.mm 

221,461 

com.motorola.messaging 

203,649 

com.android.calculator2 

167,435 

com.verizon.messaging.vzmsgs 

137,339 

com.groupme. android 

134,896 

com.handcent.nextsms 

123,065 

com.jb.gosms 

118,316 

com.sonyericsson.conversations 

114,219 

com.twitter.android 

92,605 

App  Name 

Visits 

TouchWiz 

home 

101,151 

WhatsApp 

64,038 

Messaging 

60,015 

Launcher 

39,113 

Facebook 

38,591 

Google 

Search 

32,947 

Chrome 

32,032 

Snapchat 

23,481 

System  UI 

22,772 

Phone 

19,396 

Gmail 

19,329 

Messages 

19,154 

Contacts 

18,668 

Hangouts 

17,209 

Home 

16,775 

HTC  Sense 

16,325 

YouTube 

14,552 

Xperia  Home 

13,639 

Instagram 

13,146 

Settings 

12,675 

(a)  (b) 


Website  Domain 

Visits 

www.google.com 

19,004 

m.facebook.com 

9,300 

www.reddit.com 

4,348 

forums.huaren.us 

3,093 

leam.dcollege.net 

2,133 

en.m.wikipedia.org 

1,825 

mail.drexel.edu 

1,520 
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Duration  of  Active  Interaction  (Hours) 


one.drexel.edu 

1,472 

login.drexel.edu 

1,462 

likes.com 

1,361 

mail.google.com 

1,292 

i.  imgur.com 

1,132 

www.amazon.com 

1,079 

netcontrol.irt.drexel.edu 

1,049 

www.facebook.com 

903 

banner.drexel.edu 

902 

m.hupu.com 

824 

t.co 

801 

duapp2 .  drexel.edu 

786 

m.ign.com 

725 

(C) 


Figure  1 :  The  duration  of  time  (in  hours)  that  each  of  the  200  users  actively  interacted  with  their 

device.. 
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4  Assumptions  and  Procedures 


4.1  Features  and  Classifiers 

The  four  distinct  biometric  modalities  considered  in  our  analysis  are  (1)  text  entered  via  soft 
keyboard,  (2)  applications  used,  (3)  websites  visited,  and  (4)  physical  location  of  the  device  as 
determined  from  GPS  (when  outdoors)  or  WiFi  (when  indoors).  We  refer  to  these  four  modalities 
as  text,  app,  web,  and  location,  respectively.  In  this  section  we  discuss  the  features  that  were 
extracted  from  the  raw  data  of  each  modality,  and  the  classifiers  that  were  used  to  map  these 
features  into  binary  decision  space. 

A  binary  classifier  is  constructed  for  each  of  the  200  users  and  4  modalities.  In  total,  there  are 
800  classifiers,  each  producing  either  a  probability  that  a  user  is  valid  P(Hi)  (or  a  binary  decision 
of  0  (invalid)  or  1  (valid).  The  first  class  (Hi)  for  each  classifier  is  trained  on  the  valid  user’s  data 
and  the  second  class  (Ho)  is  trained  on  the  other  199  users’  data.  The  training  process  is 
described  in  more  detail  in  §5.1.  For  app,  web,  and  location,  the  classifier  takes  a  single  instance 
of  the  event  and  produces  a  probability.  For  multiple  events  of  the  same  modality,  the  set  of 
probabilities  is  fused  across  time  using  maximum  likelihood: 


H*  =  argmax  ^  P(xt\Hi), 

iE{0,l}xtEa 


(1) 


where  Q  =  {xt|  Tcurrent  “  T(xt)  <  (o),  u)  is  a  fixed  window  size  in  seconds,  T(xt)  is  the  timestamp  of 

event  Xt,  and  Tcurrent  is  the  current  timestamp.  The  process  of  fusing  classifier  scores  across  time 
is  illustrated  in  Fig.  3. 
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Figure  2:  An  aggregate  heatmap  showing  a  selection  from  the  dataset  of  GPS  locations  in  the 

Philadelphia  area. 


4.1.1  Text 

As  Table  3  a  indicates,  the  apps  into  which  text  was  entered  on  mobile  devices  varied,  but  the 
activity  in  majority  of  the  cases  was  communication  via  SMS,  MMS,  WhatsApp,  Facebook, 
Google  Hangouts,  and  other  chat  apps.  Therefore,  text  events  fired  in  short  bursts.  The  tracking 
application  captured  the  keys  that  were  touched  on  the  keyboard  and  not  the  autocorrected  result. 
Therefore,  the  majority  of  the  typed  messages  had  a  lot  of  misspellings  and  words  that  were 
erased  in  the  final  submitted  message.  In  the  case  of  SMS,  we  also  were  able  to  record  the 
submitted  result.  For  example,  an  SMS  text  that  was  submitted  as  “Sorry  couldn't  call  back.”  had 
associated  with  it  the  following  recorded  keystrokes:  “Sprry  coyld  cpuldn't  vsll  back.” 
Classification  based  on  the  actual  typed  keys  in  principle  is  a  better  representation  of  the  person’s 
linguistic  style.  It  captures  unique  typing  idiosyncrasies  that  autocorrect  can  conceal.  As 
discussed  in  §2,  we  implemented  a  one-feature  n-gram  classifier  from  [14]  that  has  been  shown 
to  work  well  on  short  messages.  It  works  by  analyzing  the  presence  or  absence  of  n-grams  with 
respect  to  the  training  set. 
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4.1.2  App  and  Web 


The  app  and  web  classifier  models  we  construct  are  identical  in  their  structure.  For  the  app 
modality  we  use  the  app  name  as  the  unique  identifier  and  count  the  number  of  times  a  user  visits 
each  app  in  the  training  set.  For  the  web  modality  we  use  the  domain  of  the  URL  as  the  unique 
identifier  and  count  the  number  of  times  a  user  visits  each  domain  in  the  training  set.  Note  that, 
for  example,  “m. facebook.com”  is  a  considered  a  different  domain  than  “www.facebook.com” 
because  the  subdomain  is  different.  In  this  section  we  refer  to  the  app  name  and  the  web  domain 
as  an  “entity”.  Table  3b  and  Table  3c  show  the  top  entities  aggregated  across  all  200  users  for 
app  and  web  respectively. 

For  each  user,  the  classification  model  for  the  valid  class  is  constructed  by  determining  the  top 
20  entities  visited  by  that  user  in  the  training  set.  The  quantity  of  visits  is  then  normalized  so  that 
the  20  frequency  values  sum  to  1 .  The  classification  model  for  the  invalid  class  is  constructed  by 
counting  the  number  of  visit  by  the  other  199  users  to  those  same  20  domains,  such  that  for  each 
of  those  domains  we  now  have  a  probability  that  a  valid  user  visits  it  and  an  invalid  user  visits  it. 
The  evaluation  for  each  user  given  the  two  empirical  distributions  is  performed  by  the  maximum 
likelihood  product  in  (1).  Entities  that  do  not  appear  in  the  top  20  are  considered  outliers  and  are 
ignored  in  this  classifier. 

4.1.3  Location 

Location  is  specified  as  a  pair  of  values:  latitude  and  longitude.  Classification  is  performed  using 
support  vector  machines  (SVMs)  [23]  with  the  radial  basis  function  (RBF)  as  the  kernel  function. 
The  SVM  produces  a  classification  score  for  each  pair  of  latitude  and  longitude.  This  score  is 
calibrated  to  form  a  probability  using  Platt  scaling  [24]  which  requires  an  extra  logistic 
regression  on  the  SVM  scores  via  an  additional  cross-validation  on  the  training  data.  All  of  the 
code  in  this  report  is  written  by  the  authors  except  for  the  SVM  classifier.  Since  the 
authentication  system  is  written  in  C++,  we  used  the  Shark  3.0  machine  learning  library  for  the 
SVM  implementation. 
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4.2  Decision  Fusion 


Start  of 


Figure  3:  The  fusion  architecture  across  time  and  across  classifiers.  The  text,  app,  web,  and 
location  boxes  indicate  a  firing  of  a  single  event  associated  with  each  of  those  modalities. 

Multiple  classifier  scores  from  the  same  modality  are  fused  via  (1)  to  produce  a  single  local 
binary  decision.  Local  binary  decisions  from  each  of  the  four  modalities  are  fused  via  (4)  to 
produce  a  single  global  binary  decision. 

Decision  fusion  with  distributed  sensors  is  described  by  Tenney  and  Sandell  in  [25]  who  studied 
a  parallel  decision  architecture.  As  described  in  [26],  the  system  comprises  of  n  local  detectors, 
each  making  a  decision  about  a  binary  hypothesis  (Ho, Hi),  and  a  decision  fusion  center  (DFC) 
that  uses  these  local  decisions  {ui,U2,...,Un}  for  a  global  decision  about  the  hypothesis.  The 
detector  collects  K  observations  before  it  makes  its  decision,  m.  The  decision  is  Ui  =  1  if  the 
detector  decides  in  favor  of  Hi  and  u,  =  -1  if  it  decides  in  favor  of  Ho.  The  DFC  collects  the  n 
decisions  of  the  local  detectors  and  uses  them  in  order  to  decide  in  favor  of  Ho(u  =  -1)  or  in 
favor  of  Hi(u  =1).  Tenney  and  Sandell  [25]  and  Reibman  and  Nolte  [27]  studied  the  design  of 
the  local  detectors  and  the  DFC  with  respect  to  a  Bayesian  cost,  assuming  the  observations  are 
independent  conditioned  on  the  hypothesis.  The  ensuing  formulation  derived  the  local  and  DFC 
decision  rules  to  be  used  by  the  system  components  for  optimizing  the  system-wide  cost.  The 
resulting  design  requires  the  use  of  likelihood  ratio  tests  by  the  decision  makers  (local  detectors 
and  DFC)  in  the  system.  However  the  thresholds  used  by  these  tests  require  the  solution  of  a  set 
of  nonlinear  coupled  differential  equations.  In  other  words,  the  design  of  the  local  decision 
makers  and  the  DFC  are  co-dependent.  In  most  scenarios  the  resulting  complexity  renders  the 
quest  for  an  optimal  design  impractical. 

Chair  and  Varshney  in  [12]  developed  the  optimal  fusion  rule  when  the  local  detectors  are  fixed 
and  local  observations  are  statistically  independent  conditioned  on  the  hypothesis.  Data  Fusion 
Center  is  optimal  given  the  performance  characteristics  of  the  local  fixed  decision  makers.  The 
result  is  a  suboptimal  (since  local  detectors  are  fixed)  but  computationally  efficient  and  scalable 
design.  In  this  study  we  use  the  ChairVarshney  formulation.  The  parallel  distributed  fusion 
scheme  (see  Fig.  3)  allows  each  classifier  to  observe  an  event,  minimize  the  local  risk  and  make 
a  local  decision  over  the  set  of  hypothesis,  based  on  only  its  own  observations.  Each  classifier 
sends  out  a  decision  of  the  form: 
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1,  if  Hi  is  decided 
—  1,  if  Hq  is  decided 


(2) 


The  fusion  center  combines  these  local  decisions  by  minimizing  the  global  Bayes’  risk.  The 
optimum  decision  rule  performs  the  following  likelihood  ratio  test 


P(ui,  ■■■,»„ |//i)  ^ 

P{ui,...,Un\Ho)  Pi 


(3) 


where  the  a  priori  probabilities  of  the  binary  hypotheses  Hi  and  Ho  are  Pi  and  Po  respectively.  In 
this  case  the  general  fusion  rule  proposed  in  [12]  is 


(4) 


with  P,^,P/^  representing  the  False  Rejection  Rate  (FRR)  and  False  Acceptance  Rate  (FAR)  of 
the  classifier  respectively.  The  optimum  weights  minimizing  the  global  probability  of  error  are 
given  by 


(5) 


(6) 


The  threshold  in  (3)  requires  knowledge  of  the  a  priori  probabilities  of  the  hypotheses.  In 
practice,  these  probabilities  are  not  available,  and  the  threshold  t  is  determined  using  different 
considerations  such  as  fixing  the  probability  of  false  alarm  or  false  rejection  as  is  done  in  §5.3. 

5  Results  and  Discussion 

5.1  Training,  Characterization,  Testing 

The  data  of  each  of  the  200  users’  active  interaction  with  the  mobile  device  was  divided  into  5 
equal-size  folds  (each  containing  20%  time  span  of  the  full  set).  We  performed  training  of  each 
classifier  on  the  first  three  folds  (60%).  We  then  tested  their  performance  on  the  fourth  fold.  This 
phase  is  referred  to  as  “characterization”,  because  its  sole  purpose  is  to  form  estimates  of  FAR 
and  FRR  for  use  by  the  fusion  algorithm.  We  then  tested  the  performance  of  the  classifiers. 
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individually  and  as  part  of  the  fusion  system,  on  the  fifth  fold.  This  phase  is  referred  to  as 
“testing”  sinee  this  is  the  part  that  is  used  for  evaluation  the  performanee  of  the  individual 
classifiers  and  the  fusion  system.  The  three  phases  of  training,  characterization,  and  testing  as 
they  relate  to  the  data  folds  are  shown  in  Fig.  4. 

•  Training  on  folds  1,  2,  3. 

Characterization  on  fold  4. 

Testing  on  fold  5. 

•  Training  on  folds  2,  3,  4. 

Characterization  on  fold  5. 

Testing  on  fold  1. 

•  Training  on  folds  3,  4,  5. 

Characterization  on  fold  1 . 

Testing  on  fold  2. 

•  Training  on  folds  4,  5,  1. 

Characterization  on  fold  2. 

Testing  on  fold  3. 

•  Training  on  folds  5,  1,2. 

Characterization  on  fold  3. 

Testing  on  fold  4. 


User  3 


60%  of  user  3  data 


Class  1:  Accept 

Training 

_ i _ 

Characterization 

1 

Testing 

1 

User  1 

60%  of  user  1  data 

20%  of  user  1 

20%  of  user  1 

Class  2:  Reject 

Training 

_ i _ 

Characterization 

1 

Testing 

1 

1 

User  2 

60%  of  user  2  data 

20%  of  user  2 

1  20%  of  user  2 

20%  of  user  3  20%  of  user  3 


User  67 


60%  of  user  67  data 


20%  of  user  67  20%  of  user  67 


Figure  4:  The  three  phases  of  processing  the  data  to  determine  the  individual  performance  of 
each  classifiers  and  the  performance  of  the  fusion  system  that  combines  some  subset  of  these 

classifiers. 


Approved  for  Public  Release;  Distribution  Unlimited 

13 


The  common  evaluation  method  used  with  each  classifier  for  data  fusion  was  measuring  the 
averaged  error  rates  across  five  experiments;  In  each  experiment,  data  of  3  folds  was  taken  for 
training,  1  fold  for  characterization,  and  1  for  testing.  The  FAR  and  FRR  computed  during 
characterization  were  taken  as  input  for  the  fusion  system  as  a  measurement  of  the  expected 
performance  of  the  classifiers.  Therefore  each  experiment  consisted  of  three  phases:  1)  train  the 
classifier(s)  using  the  training  set,  2)  determine  FAR  and  FRR  based  on  the  training  set,  and  3) 
classify  the  windows  in  the  test  set. 

5.2  Performance:  Individual  Classifiers 

The  conflicting  objectives  of  an  active  authentication  system  are  of  response-time  and 
performance.  The  less  the  system  waits  before  making  an  authentication  decision,  the  higher  the 
expected  rate  of  error.  As  more  behavioral  biometric  data  trickles  in,  the  system  can,  on  average, 
make  a  classification  decision  with  greater  certainty. 

This  pattern  of  decreased  error  rates  with  an  increased  decision  window  can  be  observed  in  Fig.  5 
that  shows  (for  10  different  time  windows)  the  FAR  and  FRR  of  the  4  classifiers  averaged  over 
the  200  users  with  the  error  bars  indicating  the  standard  deviation.  The  “testing  fold”  (see  §5.1) 
is  used  for  computing  these  error  rates.  The  “characterization  fold”  does  not  affect  these  results, 
but  is  used  only  for  FAR/FRR  estimation  required  by  the  decision  fusion  center  in  §5.3. 

The  “time  before  decision”  is  the  time  between  the  first  event  indicating  activity  and  the  first 
decision  produced  by  the  fusion  system.  This  metric  can  be  thought  of  as  “decision  window 
size”.  Events  older  than  the  time  range  covered  by  the  time-window  are  disregarded  in  the 
classification.  If  no  event  associated  with  the  modality  under  consideration  fires  in  a  specific 
time  window,  no  error  is  added  to  the  average. 

Table  4:  The  rates  at  which  an  event  associated  with  each  modality  “fires”  per  hour.  On  average, 

GPS  location  is  provided  only  3.5  times  an  hour. 


Event 

Firing  Rate  (per 
hour) 

Text 

557.8 

App 

23.2 

Web 

5.6 

Location 

3.5 

There  are  two  notable  observations  about  the  FAR/FRR  plots  in  Fig.  5.  First,  the  location 
modality  provides  the  lowest  error  rates  even  though  on  average  across  the  dataset  it  fires  only 
3.5  times  an  hour  as  shown  in  Table  4.  This  means  that  classification  on  a  single  GPS  coordinate 
is  sufficient  to  correctly  verify  the  user  with  an  FAR  of  under  0.1  and  an  FRR  of  under  0.05. 
Second,  the  text  modality  converges  to  an  FAR  of  0.16  and  an  FRR  of  0.11  after  30  minutes 
which  is  one  of  the  worse  performers  of  the  four  modalities,  even  though  it  fires  557.8  times  an 
hour  on  average.  At  the  30  minute  mark,  that  firing  rate  equates  to  an  average  text  block  size  of 
279  characters.  An  FAR/FRR  of  0.16/0.1 1  with  279  characters  blocks  improves  on  the  error  rates 
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achieved  in  [14]  with  500  character  blocks  which  in  turn  improved  on  the  errors  rates  achieved  in 
prior  work  for  blocks  of  small  text  (see  [14]  for  a  full  reference  list  on  short-text  stylometric 
analysis). 

In  addition  to  the  main  four  features  under  consideration  in  this  report,  we  evaluated  the 
contribution  of  eye  tracking  and  an  alternate  stylometry  feature-set.  The  EER  of  the  eye  tracking 
metric  was  0.12  and  the  EER  of  the  alternate  stylometry  metric  was  0.26. 

5.3  Performance:  Decision  Fusion 

The  events  associated  with  each  of  the  4  modalities  fire  at  very  different  rates  as  shown  in  Table 
4.  Moreover,  text  events  fire  in  bursts,  while  the  location  events  fire  at  regularly  spaced  intervals 
when  GPS  signal  is  available.  The  app  and  web  events  fire  at  varying  degrees  of  burstiness 
depending  on  the  user.  Fig.  6  shows  the  distribution  of  the  number  of  events  that  fire  within  each 
of  the  time  windows.  An  important  takeaway  from  these  distributions  is  that  most  events  come  in 
bursts  followed  by  periods  of  inactivity.  This  results  in  the  counterintuitive  fact  that  the  1  minute, 
10  minute,  and  30  minute  windows  have  a  similar  distribution  on  the  number  of  events  that  fire 
within  them.  This  is  why  the  decrease  in  error  rates  attained  from  waiting  longer  for  a  decision  is 
not  as  significant  as  might  be  expected. 

Asynchronous  fusion  of  classification  of  events  from  each  of  the  four  modalities  is  robust  to  the 
irregular  rates  at  which  events  fire.  The  decision  fusion  rule  in  (4)  utilizes  all  the  available 
biometric  data,  weighing  each  classifier  according  to  its  prior  performance.  Fig.  7  shows  the 
receiver  operating  characteristic  (ROC)  curve  trading  off  between  FAR  and  FRR  by  varying  the 
threshold  parameter  t  in  (3). 

As  the  size  of  the  decision  window  increases,  the  performance  of  the  fusion  system  improves, 
dropping  from  an  equal  error  rate  (EER)  of  0.05  using  the  1  minute  window  to  below  0.01  EER 
using  the  30  minute  window. 


5.4  Contribution  of  Local  Classifiers  to  Global  Decision 


The  performance  of  the  fusion  system  that  utilizes  all  four  modalities  of  text,  app,  web,  and 
location  is  described  in  the  previous  section.  Besides  this,  we  are  able  to  use  the  fusion  system  to 
characterize  the  contribution  of  each  of  the  local  classifiers  to  the  global  decision.  This  is  the 
central  question  we  consider  in  the  report:  what  biometric  modality  is  most  helpful  in  verifying  a 
person’s  identity  under  a  constraint  of  a  specific  time  window  before  the  verification  decision 
must  be  made?  We  measure  the  contribution  Q  of  each  of  the  four  classifiers  by  evaluating  the 
performance  of  the  system  with  and  without  the  classifier,  and  computing  the  contribution  by: 


a  = 


Ei-E 

Ei 


(7) 


where  E  is  the  error  rate  computed  by  averaging  FAR  and  FRR  of  the  fusion  system  using  the 
full  portfolio  of  4  classifiers,  E,  is  the  error  rate  of  the  fusion  system  using  all  but  the  i-th 
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classifier,  and  C,  is  the  relative  contribution  of  the  z-th  classifier  as  shown  in  Fig.  8.  We  consider 
the  contribution  of  each  classifier  under  three  time  windows  of  1  minute,  10  minutes,  and  30 
minutes.  Location  contributes  the  most  in  all  three  cases,  with  the  second  biggest  contributor 
being  web  browsing.  Text  contributes  the  least  for  the  small  window  of  1  minute,  but  improve 
for  the  large  windows.  App  usage  is  the  least  predictable  contributor. 
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Figure  5:  FAR  and  FRR  performance  of  the  individual  classifiers  associated  with  each  of  the 

four  modalities. 

Each  bar  represent  the  average  error  rate  for  a  given  module  and  time  window.  Each  of  the  200 
users  has  2  classifiers  for  each  modality,  so  each  bar  provides  a  value  that  was  averaged  over  200 
individual  error  rates.  The  error  bar  indicate  the  standard  deviation  across  these  200  values. 
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Number  of  Events  Fired  in  Time  Window 

Figure  6:  The  distribution  of  the  number  of  events  that  fire  within  a  given  time  window. 

This  is  a  long  tail  distribution  as  non-zero  probabilities  of  event  frequencies  above  13  extend  to 
over  100.  These  outliers  are  excluded  from  this  histogram  plot  in  order  to  highlight  the  high- 
probability  frequencies.  Time  windows  in  which  no  events  fire  are  not  included  in  this  plot. 
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FalseAcceptRate(FAR) 


False  Reject  Rate  (FRR) 

Figure  7:  The  performanee  of  the  fusion  system  with  4  classifiers  on  the  200  subject  dataset. 

The  ROC  curve  shows  the  tradeoff  between  FAR  and  FRR  achieved  by  varying  the  threshold 
parameter  ao  in  (4). 
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Figure  8:  Relative  contribution  of  each  of  the  4  classifiers  computed  according  to  (7). 
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6  Conclusions 


In  this  work,  we  proposed  a  parallel  binary  decision-level  fusion  architecture  for  classifiers  based 
on  four  biometric  modalities:  text,  application  usage,  web  browsing,  and  location.  Using  this 
fusion  method  we  addressed  the  problem  of  active  authentication  and  characterized  its 
performance  on  a  real-world  dataset  of  200  subjects,  each  using  their  personal  Android  mobile 
device  for  a  period  of  at  least  30  days.  The  authentication  system  achieved  an  equal  error  rate 
(ERR)  of  0.05  (5%)  after  1  minute  of  user  interaction  with  the  device,  and  an  EER  of  0.01  (1%) 
after  30  minutes.  We  showed  the  performance  of  each  individual  classifier  and  its  contribution  to 
the  fused  global  decision.  The  location-based  classifier,  while  having  the  lowest  firing  rate, 
contributes  the  most  to  the  performance  of  the  fusion  system. 


7  Disclaimer 

This  research  was  developed  with  funding  from  the  Defense  Advanced  Research  Projects 
Agency  (DARPA).  The  views,  opinions,  and/or  findings  contained  in  this  article  are  those  of  the 
authors  and  should  not  be  interpreted  as  representing  the  official  views  or  policies  of  the 
Department  of  Defense  or  the  U.S.  Government. 
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